Session 1: Introduction, tidy data analysis and workflow
Professor Di Cook
Department of Econometrics and Business Statistics
About the presenter
Dianne Cook Distinguished Professor Monash University
🌐 https://dicook.org/
🦣 @visnut@aus.social
@visnut.bsky.social
I have a PhD in Statistics from Rutgers University, NJ, and a Bachelor of Science (Pure Mathematics, Statistics and Biochemistry) from University of New England
I am a Fellow of the American Statistical Association, elected member of the the R Foundation and International Statistical Institute, Past-Editor of the Journal of Computational and Graphical Statistics, and the R Journal.
My research is in data visualisation, statistical graphics and computing, with application to sports, ecology and bioinformatics. I like to develop new methodology and software.
Students in my lab work on methods and software that is generally useful for the world. They have been responsible for bringing you the ggplot2, tidyverse suite, knitr, plotly, and many other R packages are frequently used.
Got a question, or a comment?
✋ 🔡 You can ask directly by raising your hand. any time.
I hope you have many questions! 🙋🏻👣
Outline
Follow along
Summary of materials
This workshop is designed to provide you with every day tools to improve your data analysis efficiency and effectiveness. All the code and examples to reproduce everything discussed are available at https://dicook.github.io/BAPPENAS_2025/
Big data problems that are actually small data problems, once you have the right subset/sample/summary. ~90%
Big data problems that are actually lots and lots of small data problems, e.g. you need to fit one model per individual for thousands of individuals. ~9%
Finally, there are irretrievably big problems where you do need all the data, perhaps because you fitting a complex model.
At least when you first tackle a data problem, after which you might scale up and automate operations.
Approach
The methods and tools discussed will ensure you can get started and have a process to follow to develop the appropriate analysis.
Tidy data
Using tidyr, dplyr
Writing readable code using pipes
What is tidy data? Why do you want tidy data? Getting your data into tidy form using tidyr.
Reading different data formats
String operations, working with text
The pipe operator %>% or |>
read as then
x %>% f(y) and x |> f(y) is the same as f(x, y)
%>% is part of the dplyr package (really, magrittr), |> is part of base R
pipes structure code as sequence of operations – as opposed to function order g(f(x))
The pipe operator %>% or |>
%>% is part of dplyr package (or more precisely, the magrittr package)
R 4.1 introduced the |> base pipe (no package necessary)
An explanation of the (subtle) differences between the pipes can be found here
tb <-read_csv(here::here("data/TB_notifications_2025-07-22.csv"))tb %>%# first we get the tb datafilter(year ==2023) %>%# then we focus on the most recent yeargroup_by(country) %>%# then we group by countrysummarize(cases =sum(c_newinc, na.rm=TRUE) # to create a summary of all new cases ) %>%arrange(desc(cases)) # then we sort countries to show highest number of new cases first
tb <-read_csv(here::here("data/TB_notifications_2025-07-22.csv"))tb |># first we get the tb datafilter(year ==2023) |># then we focus on the most recent yeargroup_by(country) |># then we group by countrysummarize(cases =sum(c_newinc, na.rm=TRUE) # to create a summary of all new cases ) |>arrange(desc(cases)) # then we sort countries to show highest number new cases first
# A tibble: 215 × 2
country cases
<chr> <dbl>
1 India 2382714
2 Indonesia 804836
3 Philippines 575770
4 China 564918
5 Pakistan 475761
6 Nigeria 367250
7 Bangladesh 302813
8 Democratic Republic of the Congo 258069
9 South Africa 211810
10 Ethiopia 134873
# ℹ 205 more rows
What is tidy data?
Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
# A tibble: 6 × 4
Inst AvNumPubs AvNumCits PctCompletion
<chr> <dbl> <dbl> <dbl>
1 ARIZONA STATE UNIVERSITY 0.9 1.57 31.7
2 AUBURN UNIVERSITY 0.79 0.64 44.4
3 BOSTON COLLEGE 0.51 1.03 46.8
4 BOSTON UNIVERSITY 0.49 2.66 34.2
5 BRANDEIS UNIVERSITY 0.3 3.03 48.7
6 BROWN UNIVERSITY 0.84 2.31 54.6
What’s in the column names of this data? What are the experimental units? What are the measured variables?
10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice.
What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?
Messy data patterns
There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns:
Column headers are not just variable names, but also contain values
Variables are stored in both rows and columns, contingency table format
One type of experimental unit stored in multiple tables
Dates in many different formats
Tidy Data Conventions
Data is contained in a single table
Each observation forms a row (no data info in column names)
Each variable forms a column (no mashup of multiple pieces of information)
Long and Wide
Long form: one measured value per row. All other variables are descriptors (key variables) - good for modelling, terrible for most other analyses, e.g. correlation matrix
Widest form: all measured values for an entity are in a single row.
Wide form: measurements are arranged by some of the descriptors in columns (for direct comparisons)
Illustrations from the Openscapes blog: Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst
Tidy verbs
pivot_longer: get information out of names into columns
pivot_wider: make columns of observed data for levels of design variables (for comparisons)
separate/unite: split and combine columns
nest/unnest: make/unmake variables into sub-data frames of a list variable
Pivot to long form
data |> pivot_longer(cols, names_to = "name", values_to = "value", ...)
pivot_longer turns a wide format into a long format
two new variables are introduced (in key-value format): name and value
# A tibble: 6 × 7
country iso3 year toss_new toss_sp sexage value
<chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan AFG 1997 new sp m014 0
2 Afghanistan AFG 1997 new sp m1524 10
3 Afghanistan AFG 1997 new sp m2534 6
4 Afghanistan AFG 1997 new sp m3544 3
5 Afghanistan AFG 1997 new sp m4554 5
6 Afghanistan AFG 1997 new sp m5564 2
Separate columns
data %>% separate_wider_position(col, widths, ...)
split column col from frame into a set of columns specified in widths
widths is named numeric vector where the names become column names; unnamed components will be matched but not included.
Separate TB notifications again
Now split sexage into first character (m/f) and rest.
tb3 <- tb2 %>% dplyr::select(-starts_with("toss")) |># remove the `toss` variablesseparate_wider_position( sexage,widths =c(sex =1, age =4),too_few ="align_start" )tb3 |>na.omit() |>head()
# A tibble: 6 × 6
country iso3 year sex age value
<chr> <chr> <dbl> <chr> <chr> <dbl>
1 Afghanistan AFG 1997 m 014 0
2 Afghanistan AFG 1997 m 1524 10
3 Afghanistan AFG 1997 m 2534 6
4 Afghanistan AFG 1997 m 3544 3
5 Afghanistan AFG 1997 m 4554 5
6 Afghanistan AFG 1997 m 5564 2
Your turn
Read the genes data from folder data. Column names contain data and are kind of messy.